class: center, middle, inverse, title-slide # Chicago Taxi Data Analysis ## SCV Group ### ### 27 Oct 2021 --- class: title-slide,middle background-image: url("pikapika.svg"), url("assets/taxi.jpeg") background-position: 10% 90%, 100% 50% background-size: 100px, 100% 100% <!-- background-color: #0148A4 --> ## .black[Hi! During your Journey,] ## .black[Drive safe &] ## .white[Scroll each page please] ## .black[Enjoy! :-)] --- ## Visualise Data .scroll-output[ First of all, we do the basic analysis on the dataset and look at each variable. For the continues variables, they were visualised using ggpair function in GGally package, a powerful tool that gives scatterplot for each pair of the variables and a histogram for each single variable, as well as the correlation between each pair of variables. For instance, we expect to see a correlation between trip_total and trip_miles. In the scatter plot, we can spot an almost linear line and also the outliners. In the correlation cell, we get a correlation coefficient of 0.467, which indicates a moderate positive association. <img src="./assets/1.png" width="5120" /> To further visualise the variable trip_total which is of our main interest, we use a geom_histogram in ggplot2 and carefully select the binwidth. <img src="./assets/4.png" width="1867" /> ## Boxplots: Revenue vs. Zone With regards to discrete variables, we have 77 zones in Chicago and it is too much to plot out. However, since we are interested in the zones the generate most revenue, we plot out the top 10 zones. To visualise the distribution of the data as well as give a comprehensive comparison among zones, we use boxplots in parallel. From the plots below, it is clear that zone 76 and zone 56 generate the most revenue, whereas they are also the two zones where the average avenue reaches the highest, hence they might be where drivers like most. While plotting, we found extreme outliners with trip_total > 200. To better visualise the data, we set yscale to exclude them. <img src="./assets/5.png" width="1867" /> Further, we will prefer using median over mean because we noticed that most of the distributions of revenue are skewed. <img src="./assets/6.png" width="5120" /> As for the dropoff area, it is obvious that zone 76 produced most revenue. <img src="./assets/7.png" width="1867" /> However, we prefer using pickup_community_area over dropoff_comunity_area to answer our question since we see the pickup zone as the location where new revenue are generated and are of driver’s interest. ] --- ## Total Revenue vs. Zone Analysis by Maps .scroll-output[
After looking at the revenue map, we wonder why some of places generate more revenue than others? To further explore the reason, we introduce outside dataset about the amenities as well as the population data from Wikipedia. ## Population We think population might be one of the main factors that affects the total avenue of each zone. Hence we extract the population data from Wikipedia page and join it into our original dataset. We apply logarithmic function on population to make the map more sensible. After comparing the two maps, we think the population might be one of the exploratory variables to the total revenue.
## Load Open Street Map Data Apart from population, we think the location of certain city amenities might be another main factor that affects the total avenue of each zone. Hence we extract amenity information including theatre, restaurant and hospital from Open Street Map Data (osmdata) package and plot it out using the same package. Due to the limit of the dataset, we cannot get all the places where people are likely to take a taxi. But we can still identify the places with most clustered orange dots are popular area for taking taxi, also they might be in the centre of Chicago. <img src="./assets/2.png" width="1867" /> To make it easier to compare, we plot out all roads for cars and add the zone border into the map, as well as the amenities on top of them. <img src="./assets/3.png" width="1867" /> Now, it becomes clearer that the reason for higher revenue in certain zone might be related to the density of certain amenities. The only exception is O’Hare, where not many amenities are gathered but a hot spot for taxis. It turned out to be that O’Hare is the location of Chicago O'Hare International Airport, which explains why it attracts most of the taxis and produces such a high revenue. ] --- ## .white[Animation] .scroll-output[ <!-- --> ] --- ## Time Map .scroll-output[
] <!-- .footnote[ --> <!-- By Yiren --> <!-- ] --> --- ## Questions <!-- .pull-left[ --> <!-- 1. If smoking is in fact a predicting factor for birth weight of babies, in particular does it have a negative coefficient. --> <!-- 2. Does mother’s birth weight in the last menstrual period predict baby birth weight? --> <!-- 3. What are all the significant predicting factors of baby birth weight? --> <!-- ] --> .pull-right[ <img src="assets/Picture2.png" width="507" style="display: block; margin: auto 0 auto auto;" /> ] <!-- --- --> <!-- ## Pre Analysis Observations --> <!-- There are ten variables in this dataset, as shown by the summary statistics: --> <!-- - Mean birth weight is approximately 2.9kg, about 600 grams below the expected mean of 3.5kg. This is within the normal range of weights of 2.5kg- --> <!-- .scroll-box-14[ --> <!-- ```{r, echo = FALSE, warning=FALSE} --> <!-- library(MASS) --> <!-- library(janitor) --> <!-- library(skimr) --> <!-- library(tidyr) --> <!-- library(readr) --> <!-- library(ggfortify) --> <!-- library(dplyr) --> <!-- library(tidyverse) --> <!-- library(ggplot2) --> <!-- library(visdat) --> <!-- library(sjPlot) --> <!-- library(leaps) --> <!-- library(caret) --> <!-- library(regclass) --> <!-- data = birthwt --> <!-- summary(data) --> <!-- ``` --> <!-- ] --> <!-- --- --> <!-- ## birthwt in R --> <!-- Here a glimpse of the data set we are going to use in the further research. --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE} --> <!-- data = birthwt --> <!-- glimpse(data) --> <!-- ``` --> <!-- --- --> <!-- ### data handling --> <!-- # data --> <!-- .scroll-box-14[ --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE} --> <!-- data = MASS::birthwt --> <!-- glimpse(data) --> <!-- summary(data) --> <!-- head(data) --> <!-- # data %>% skimr::skim() --> <!-- # visdat::vis_miss(data) --> <!-- data = data %>% mutate( --> <!-- low = as.factor(low), --> <!-- race = as.character(race), --> <!-- smoke = as.character(smoke), --> <!-- ui = as.character(ui), --> <!-- ht = as.character(ht) --> <!-- ) --> <!-- data_without_low = data %>% dplyr::select(-low) --> <!-- data_with_low = data --> <!-- ``` --> <!-- ] --> <!-- --- --> <!-- class: title-slide,middle --> <!-- background-image: url("assets/pika2.svg"), url("assets/title-image3.jpg") --> <!-- background-position: 10% 90%, 100% 50% --> <!-- background-size: 160px, 50% 100% --> <!-- background-color: #0148A4 --> <!-- # .text-shadow[.white[Assumption Checking]] --> <!-- ## .white[Yiren Cao] --> <!-- --- --> <!-- ## Assumption Checking - Linearity, Homoskedasticity --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=16, fig.height=15} --> <!-- data_without_low_for_correlation = data_without_low --> <!-- data_without_low_for_correlation[] <- lapply(data_without_low_for_correlation, --> <!-- function(x) as.numeric(as.character(x))) --> <!-- {{qtlcharts::iplotCorr(data_without_low_for_correlation)}} # non-multicollinearity --> <!-- ``` --> <!-- --- --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=12, fig.height=8} --> <!-- GGally::ggpairs(data_without_low) --> <!-- ``` --> <!-- --- --> <!-- ## Assumption Checking - Linearity, Homoskedasticity, Normality --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=12, fig.height=6} --> <!-- lm1_without_low = lm(bwt ~ ., data = data_without_low) --> <!-- autoplot(lm1_without_low, which = 1:2) # linearity + normality --> <!-- ``` --> <!-- ```{r, echo = FALSE} --> <!-- icon::fa("bell") --> <!-- ``` --> <!-- - In addition, we can assume the Independence! --> <!-- --- --> <!-- ## Assumption Checking - No Multicollinearity --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.height=5} --> <!-- M1 = lm1_without_low # Full model --> <!-- vif_values <- regclass::VIF(M1) --> <!-- vif_values --> <!-- ``` --> <!-- .blockquote[ --> <!-- ###
Variance Inflation Factor (VIF) --> <!-- - when VIF is equal to 1, the independent variables are not correlated to the one another --> <!-- ] --> <!-- --- --> <!-- class: inverse, center, middle --> <!-- # Another model: with `low` predictor --> <!-- --- --> <!-- ## Assumption Checking - Linearity, Homoskedasticity, Normality --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=12, fig.height=6} --> <!-- reg1 = lm(bwt ~ ., data = data_with_low) --> <!-- autoplot(reg1, which = 1:2) --> <!-- ``` --> <!-- ```{r, echo = FALSE} --> <!-- icon::fa("bell") --> <!-- ``` --> <!-- - In addition, we can assume the Independence! --> <!-- --- --> <!-- ## Assumption Checking - No Multicollinearity --> <!-- ```{r, message=FALSE, warning=FALSE, fig.width=12, fig.height=5} --> <!-- vif_values_with_low <- regclass::VIF(reg1) --> <!-- vif_values_with_low --> <!-- ``` --> <!-- --- --> <!-- Please add analysis here --> <!-- --- --> <!-- class: title-slide,middle --> <!-- background-image: url("assets/pika2.svg"), url("assets/title-image2.jpg") --> <!-- background-position: 10% 90%, 100% 50% --> <!-- background-size: 160px, 50% 100% --> <!-- background-color: #0148A4 --> <!-- # .text-shadow[.white[Model Selection]] --> <!-- # .text-shadow[.white[Assumption Re-check]] --> <!-- ## .white[Yiren Cao] --> <!-- --- --> <!-- ## Model Selection --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=5, fig.height=5} --> <!-- cv_with_low = train( --> <!-- bwt ~ low + ui+ smoke + race, data, --> <!-- method = "lm", --> <!-- trControl = trainControl( --> <!-- method = "cv", number = 10, --> <!-- verboseIter = FALSE --> <!-- ) --> <!-- ) --> <!-- cv_without_low = train( --> <!-- bwt ~ lwt + race + smoke + ht+ ui, data_without_low, --> <!-- method = "lm", --> <!-- trControl = trainControl( --> <!-- method = "cv", number = 10, --> <!-- verboseIter = FALSE --> <!-- ) --> <!-- ) --> <!-- # cv_without_low --> <!-- # cv_with_low --> <!-- results = resamples(list(without_low = cv_without_low, with_low = cv_with_low)) --> <!-- ggplot(results, metric = "RMSE") + labs(y = "RMSE") --> <!-- ggplot(results, metric = "MAE") + labs(y = "MAE") --> <!-- ggplot(results, metric = "Rsquared") + labs(y = "R squared") --> <!-- icon::fa("spinner", size = 2, animate = "spin") --> <!-- ``` --> <!-- --- --> <!-- ## Assumption Re-check --> <!-- ```{r, echo = FALSE, fig.width=10, fig.height=5} --> <!-- step.back.aic = step(M1, direction = "backward", trace = FALSE) --> <!-- autoplot(step.back.aic, which = 1:2) --> <!-- vif_values_without_low <- regclass::VIF(step.back.aic) --> <!-- vif_values_without_low --> <!-- ``` -->